To get a better understanding of the cell line effect a clustering analysis on differentially expressed miRNAs in control cells was performed.The statistical analysis underlying the characterization as differentially expressed is shown as follows:
1.3.4 Cluster analysis
1.3.4.1 Theoretical background
Hierarchical clustering
Hierarchical clustering is a method to group data based on similarity in data structure and is mostly associated with heatmaps and dendrograms for data presentation. For gene/miRNA expression analysis genes/miRNAs typically represent rows and different samples represent columns. To find genes/miRNA that are similarly expressed, distances between genes/miRNAs are measured e.g. using euclidean distance:
\[
\begin{aligned}
d = \sqrt{\sum_{i=1}^n (gene1_{i}-gene2_{i})^2}
\end{aligned}
\]
With gene1i being the expression of gene1 in sample i and gene2i being the expression of gene2 in sample i. An example for 2 samples and 2 genes is shown below. Let gene expression of gene1 be 1.6 in sample1 and 0.5 in sample2 and expression of gene2 be -0.5 in sample1 and -1.9 in sample2:
\[
\begin{aligned}
d & = \sqrt{\sum_{i=1}^n (gene1_{i}-gene2_{i})^2} \\
& = \sqrt{(gene1_{1} - gene2_{1})^2 + (gene1_{2} - gene2_{2})^2} \\
& = \sqrt{(1.6 - (-0.5))^2 + (0.5 - (-1.9))^2} \\
& = \sqrt{2.1^2 + 2.4^2} \\
& = 3.2
\end{aligned}
\] Note: for 2 samples the euclidean distance between 2 genes/miRNAs is equal to the pythagorean theorem.
Other possible distance measures include:
- pearson correlation (parametric)
- spearman correlation (non-parametric)
- kendall correlation (non-parametric)
- manhattan distance (instead of summing up squared distances and taking the sqaure root, absolute values of difference are used)
- canberra distance
- binary distance
- minkowski distance
- maximum distance
In more complex designs, distances are calculated for each gene/miRNA pair and the pair with the smallest distance is merged into a cluster (cluster1). Subsequently distances are calculated again but with the previously generated cluster being treated as a new gene/miRNA entity. If the smallest distance is between 2 other genes/miRNas, they are merged into a new separate cluster, if the smallest distance is between some gene/miRNA and cluster1, this gene is clustered next to cluster1. This process is repeated until all genes/miRNas are part of a cluster. To visualize this process a dendrogram is drawn, showing different branches to indicate relationships. The length of the branches corresponds to the time when this cluster was formed, meaning the shorter the branches the closer the relationship between the genes/miRNAs.
There are different approaches to cluster similar samples together, the most prominent being, average linkage, single linkage and complete linkage. In average linkage the distance of single genes/miRNAs to the centroid of clusters is calculated and they are merged into the cluster with the least distance to the centroid. In single linkage the distance is calculated between genes/miRNAs and the closest point in each cluster. Analogously, in complete linkage, the distance is calculated between genes/miRNas and the furthest point in each cluster.
Finally, not only genes/miRNAs can be clustered but this process can also be applied to different samples to find related samples based on the expression pattern.
Note: The choice of distance measures and clustering algorithms is completely arbitrary, so the best method is chosen by looking at the representation of the data for each process. However, there are studies indicating that distances between genes/miRNAs can preferably be calculated by correlation whereas distances between samples typically are calculated using euclidean distance. (Publications raussuchen)
K-means clustering
K-means is a different approach to cluster similar data but with a different mathematical approach and the need of supervision of a researcher, whereas hierarchical clustering typically is an unsupervised method. In k-means clustering a number k of desired clusters is specified previously to the analysis (methods to find the best k are described in a later paragraph). For an example with k=3 clusters, three random data points are chosen as the center of a cluster. Afterwards distances from each point to the three randomly chosen centers are calculated and the points are assigned to the closest cluster. In the following, the mean for each cluster is calculated and the distance from each point to the mean of each cluster is calculated. If necessary, points are reassigned and this process is repeated until the samples in the cluster do not change anymore. This explanation refers to one iteration of the k-means algorithm. The number of iterations can also be prespecified, and for each iteration three new random points are chosen as the starting points for the clusters. For each iteration the sum of variances within the clusters is calculated and summarized over all clusters. The iteration with the least sum of variances is chosen as the optimal clustering result.
To find the optimal value of k, different approaches can be used. Sometimes it is evident from a plot, a previous PCA or hierarchical cluster analysis or other exploratory data analyses how many clusters are optimal. In other cases different values for k can be used in a trial and error approach and the reduction of variance can be plotted against k. In this elbow plot there is a point after which the reduction of variance only changes slightly. This point generally corresponds to the optimal k.
1.3.4.2 Analysis
The differentially expressed miRNAs were used in a combined approach using supervised kmeans to cluster cell lines (columns) with k=3 as indicated by PCA and unsupervised hierachical clustering to cluster miRNAs (rows). Furthermore hierarchical clustering was also applied in the subclusters split by kmeans. The results were visualized in a Heatmap to reduce complexity and show the quantitative properties of each miRNA as an overview (Fig. 4).
# Heatmap
# transform data for Heatmap format
dat_Heatmap <- dat %>%
filter(Irradiation == "control" & miRNA %in% one_way_ANOVA$miRNA) %>%
select(-c(Irradiation, log_exp)) %>%
spread(miRNA, expression) %>%
mutate(cell_line = str_replace_all(cell_line, "_","-"))
# scale data for Heatmap
dat_scaled <- dat_Heatmap %>%
column_to_rownames("ID") %>%
select(-cell_line) %>%
scale() %>%
t()
# set ID as rownames so colorbar works properly
dat_colorbar <- dat_Heatmap %>% select(c(ID, cell_line)) %>%
column_to_rownames("ID")
# define colors for colorbar to identify cell lines
colorbar <- HeatmapAnnotation(
df =dat_colorbar,
col = list(
cell_line = c(
"Met-1" = "#0073C2FF",
"Met-4" = "#EFC000FF",
"SCC-12"="#868686FF" ,
"SCC-13"="#CD534CFF",
"SCL-II"="#7AA6DCFF")
),
annotation_legend_param = list(
cell_line = list(nrow=1)
)
)
# define color for the Heatmap cells
col_fun = colorRamp2(c(-1.8, 0, 1.8), c(c("#6D9EC1", "white", "#E46726")))
# Draw Heatmap
Ht <- Heatmap(
dat_scaled,
col= col_fun,
top_annotation = colorbar,
row_split = 4,
column_title = c("A", "B", "C"),
border = T,
# split cell lines in 3 clusters according to PCA
column_km = 3,
# repeat km split 100 times to get a stable cluster
column_km_repeats = 100,
# define distance metric and clustering algorithm
clustering_method_row = "average", # average linkage
clustering_method_columns = "average",
clustering_distance_row = "pearson",
clustering_distance_column = "euclidean",
# plot cosmetics
rect_gp = gpar(col = "white",lty = 1, lwd = 1),
row_names_gp = gpar(fontsize = 10),
show_column_names = FALSE,
column_names_gp = gpar(fontsize = 10),
heatmap_legend_param = list(
title = "row Z-score",
at = seq(-2,2,by=1),
color_bar="continuous",
title_position ="topcenter",
legend_direction = "horizontal",
legend_width = unit(4, "cm")
))
# plot the final heatmap with annotation and legend
draw(Ht, annotation_legend_side = "bottom", heatmap_legend_side = "bottom")
The combined cluster analysis showed a clear separation of miRNA expression depending on the cell line context (A: SCC-12/SCC-13,B: SCL-II,C: Met-1/Met-4). The miRNAs were clustered in 4 groups. Cluster 1 was upregulated in cell line cluster A compared to cluster B and C. Cluster 2 was upregulated in cell line cluster B compared to cluster A and C. Cluster 4 was upregulated in cell line cluster C compared to A and B. Cluster 3 showed an intermediate expression of miRNAs in Cluster B and C with the lowest expression values observed in Cluster A.
Taken together these findings indicate distinct miRNA clusters representing a cell line specific epigenetic pattern and possibly reflecting differentiation, p53 status or other phenotypes associated with these patterns. To investigate the biological function of the miRNAs in each cluster gene set enrichment analysis was conducted to find enriched pathways targeted by these miRNAs (section 1.3.5).
1.3.5 Gene set enrichment analysis (GSEA) with RBiomirGS
To identify the summarized function of the differentially expressed miRNAs between cell lines, miRNA clusters obtained in 1.3.4.2 were further investigated. Three distinct clusters were selected - namely cluster 1A (representing miRNAs overexpressed in cell cluster 1 containing SCC12 and SCC13), cluster 2B (miRNAs overexpressed in cell line cluster 2 containing SCL-II) and cluster 4C (miRNAs overexpressed in cell line cluster 4 containg Met-1 and Met-4) - and data was extracted from the heatmap object:
# extract clusters out of Heatmap object
Ht_clusters <- extract_clusters(dat_scaled, Ht, sampleName = "ID", sampleClust = "cellCluster",
geneName = "miRNA", geneClust = "miRCluster")
# construct new data frame containing cluster information
dat_clusters <- left_join(filter(dat, Irradiation == "control"), Ht_clusters$cellCluster) %>%
left_join(Ht_clusters$miRCluster) %>% select(-c(Irradiation, cell_line)) %>%
mutate(miRNA = str_replace_all(miRNA,"let", "hsa-let"),
miRNA = str_replace_all(miRNA,"mir", "hsa-miR"))
Afterwards fold-changes and pvalues were calculated for upregulated miRNAs in each cell line cluster vs the remaining 2 (e.g. miRNAs in cluster 1A vs miRNAs in cluster 1B + 1C).
# create list for each miRCluster
ls_miRCluster <- split(dat_clusters, f = dat_clusters$miRCluster)
# compare Cluster 1A to 1B and 1C, drop attributes (important for the pathway analysis as a list threw an error) and generate a data.frame
cl_1A <- as.data.frame(lapply(summary_clusters(ls_miRCluster, 1, "A"), FUN=drop_attr))
# compare Cluster2B to 2A and 2C, drop attributes and generate a data.frame
cl_2B <- as.data.frame(lapply(summary_clusters(ls_miRCluster, 2, "B"), FUN=drop_attr))
# compare Cluster 4C to 4A and 4B, drop attributes and generate a data.frame
cl_4C <- as.data.frame(lapply(summary_clusters(ls_miRCluster, 4, "C"), FUN=drop_attr))
These dataframes containing the miRNA name, the fold-change and pvalue of the specified cell line cluster vs the other two for each miRNA were then used for GSEA with the R package RBiomirGS providing a comprehensive tool for quantitative investigation of miRNA-mRNA interactions (Zhang 2018). Sources for gene sets used in this analyses were KEGG (Kyoto Encyclopedia of Genes and Genomes) and GO:BP (Gene Ontology: Biological Process) obtained from http://www.gsea-msigdb.org/gsea/msigdb/index.jsp as entrez IDs stored in .gmt files. The metric for the GSEA in this package is logistic regression with a negative model coefficient related to gene sets inhibited by the miRNA set under investigation and a positive model coefficient associated with gene sets activated by the miRNA set.
Bioinformatic approaches are prone to false-positive results (Godard 2015) due to annotation and research bias concerning miRNAs and pathways: On the one hand certain miRNAs are researched more extensively leading to a higher number of identified targets, on the other hand there are certain pathways - mostly associated with cancer - which are highly related and therefore enriched in many studies even though different approaches were used in those studies. Thus, the need for negative controls is crucial to eliminate this bias.
In this study three clusters should be examined with a total of 30 miRNAs (cluster 1A: 10 miRNAs, cluster 2B: 6 miRNAs, cluster 4C: 14 miRNAs) resulting in an average number of 10 miRNAs in each cluster. Therefore 10 random mature miRNAs were chosen out of a total of 2635 miRNAs obtained from miRBase (www.mirbase.org) as controls. These miRNAs were assigned random FC and pvalue pairs out of the original analysis as this is required for the performed quantitative analysis. This process was repeated 50 times to ensure a sufficiently large number of control miRNA sets. As this iterative process is time consuming, once the control data was generated it was stored in an .rds object to be accessed later. To assess bias, the number of times a pathway was enriched in these 50 control simulations was stored and divided by 50 to obtain an enrichment percentage. Pathways that were enriched in more than 10 % of controls were then omitted from the analysis to reduce bias.
1.3.5.1 KEGG pathway analysis
Calculation of controls for KEGG analysis:
# load data with mature miRNAs from miRBase
dat_miRBase <- read.csv("Data/Pathway Analysis/mature_miRNA.csv", header = FALSE) %>%
mutate(miRNA = str_extract(V1, "(hsa-miR|hsa-let)-\\d{1,5}([:alpha:]+-\\dp|\\s|-\\dp|[:alpha:]+)")) %>%
na.omit()
# data frame containing FC and pvalue of the original data used for random sampling in controls
ctrl_stats <- rbind(cl_1A, cl_2B, cl_4C) %>%
select(-miRNA)
# calculate GS enrichment for 50 random controls (containing 10 miRNAs each)
n <- 50
# ctrl_list_KEGG <- GS_controls(data= ctrl_stats,rep = n, miRdata = dat_miRBase$miRNA, sample_n = 10,
# gs_file = "c2.cp.kegg.v7.2.entrez.gmt")
# to facilitate downstream analysis the ctrl_list is saved
# saveRDS(ctrl_list, file = "ctrl_list.rds")
ctrl_list_KEGG <- readRDS(file = "Data/Pathway Analysis/ctrl_list_KEGG.rds") # file is available at https://github.com/MBender1992/PhD/blob/Marc/Data/Pathway%20Analysis/ctrl_list_KEGG.rds
# calculate bias and mean targeted genes
res_ctrl_KEGG <- pathway_ctrl_summary(ctrl_list_KEGG, n=n)
Cluster 1A
Comparison of miRNAs upregulcl_1A_predicted_mirna_mrna_iwls_KEGGated in SCC-12/SCC-13 vs SCL-II, Met-1 and Met-4.
# calculation of GSEA for miRNAs upregulated in cluster 1A
# collect mRNA targets of the miRNAs upregulated in cluster 1A
rbiomirgs_mrnascan_custom(objTitle = "cl_1A_predicted", mir = cl_1A$miRNA, sp = "hsa",
queryType = "predicted",parallelComputing = TRUE,clusterType = "PSOCK")
# calculate GS by logistic regression
rbiomirgs_logistic(objTitle = "cl_1A_predicted_mirna_mrna_iwls_KEGG",mirna_DE = cl_1A,
var_mirnaName = "miRNA",var_mirnaFC = "FC",var_mirnaP = "pvalue", mrnalist = cl_1A_predicted_mrna_entrez_list,
mrna_Weight = NULL, gs_file = "Data/Pathway Analysis/c2.cp.kegg.v7.2.entrez.gmt", optim_method = "IWLS",
p.adj = "fdr", parallelComputing = FALSE, clusterType = "PSOCK")
# total number of significantly enriched pathways
sum(cl_1A_predicted_mirna_mrna_iwls_KEGG_GS$adj.p.val < 0.05)
# remove enriched pathways with a random enrichment of more than 10 %
bias_KEGG <- names(res_ctrl_KEGG$bias[res_ctrl_KEGG$bias > 0.1])
cl_1A_KEGG_plot <- cl_1A_predicted_mirna_mrna_iwls_KEGG_GS %>% filter(!GS %in% bias_KEGG)
In total 13 (12 after bias correction) KEGG pathways were significantly enriched in the gene set targeted by the 10 miRNAs in cluster 1A. One pathway (KEGG_Glioma) was excluded from the downstream analysis as it was enriched in more than 10 % of control simulations (also for the analysis in cluster 2B and cluster 4C).
#plot results (volcano plot)
rbiomirgs_volcano(gsadfm = cl_1A_KEGG_plot,topgsLabel = TRUE,n = 15,gsLabelSize = 2,
sigColour = "red",plotWidth = 250,plotHeight = 220,xLabel = "model coefficient")
# plot distribution of enriched gene sets
rbiomirgs_bar(gsadfm = cl_1A_KEGG_plot,signif_only = F,gs.name = F,
n = "all",xLabel = "gene set", yLabel = "model coefficient", plotWidth = 250, plotHeight = 220)
# plot top enriched gene sets
rbiomirgs_bar(gsadfm = cl_1A_KEGG_plot,signif_only = 15,gs.name = T,xLabel = "model coefficient",
yTxtSize = 7, n = 15, plotWidth = 250, plotHeight = 220)
Interpretation???
Cluster 2B
Comparison of miRNAs upregulated in SCL-II vs. SCC-12, SCC-13 Met-1 and Met-4.
# collect mRNA targets of the miRNAs upregulated in cluster 2B
rbiomirgs_mrnascan_custom(objTitle = "cl_2B_predicted", mir = cl_2B$miRNA, sp = "hsa",
queryType = "predicted",parallelComputing = TRUE,clusterType = "PSOCK")
# calculate GS by logistic regression
rbiomirgs_logistic(objTitle = "cl_2B_predicted_mirna_mrna_iwls_KEGG",mirna_DE = cl_2B,
var_mirnaName = "miRNA",var_mirnaFC = "FC",var_mirnaP = "pvalue", mrnalist = cl_2B_predicted_mrna_entrez_list,
mrna_Weight = NULL, gs_file = "Data/Pathway Analysis/c2.cp.kegg.v7.2.entrez.gmt", optim_method = "IWLS",
p.adj = "fdr", parallelComputing = FALSE, clusterType = "PSOCK")
# total number of significantly enriched pathways
sum(cl_2B_predicted_mirna_mrna_iwls_KEGG_GS$adj.p.val < 0.05)
# remove enriched pathways with a random enrichment of more than 10 %
cl_2B_KEGG_plot <- cl_2B_predicted_mirna_mrna_iwls_KEGG_GS %>% filter(!GS %in% bias_KEGG)
In total 34 (33 after bias correction) KEGG pathways were significantly enriched in cluster 2B.
#plot results (volcano plot)
rbiomirgs_volcano(gsadfm = cl_2B_KEGG_plot,topgsLabel = TRUE,n = 15,gsLabelSize = 2,
sigColour = "red",plotWidth = 250,plotHeight = 220,xLabel = "model coefficient")
# plot distribution of enriched gene sets
rbiomirgs_bar(gsadfm = cl_2B_KEGG_plot,signif_only = F,gs.name = F,
n = "all",xLabel = "gene set", yLabel = "model coefficient", plotWidth = 250, plotHeight = 220)
# plot top enriched gene sets
rbiomirgs_bar(gsadfm = cl_2B_KEGG_plot,signif_only = T,gs.name = T,xLabel = "model coefficient",
yTxtSize = 7, n = "all", plotWidth = 250, plotHeight = 220)
Interpretation???
Cluster 4C
Comparison of miRNAs upregulated in Met-1 and Met-4 vs. SCC-12, SCC-13 and SCL-II.
# collect mRNA targets of the miRNAs upregulated in cluster 4C
rbiomirgs_mrnascan_custom(objTitle = "cl_4C_predicted", mir = cl_4C$miRNA, sp = "hsa",
queryType = "predicted",parallelComputing = TRUE,clusterType = "PSOCK")
# calculate GS by logistic regression
rbiomirgs_logistic(objTitle = "cl_4C_predicted_mirna_mrna_iwls_KEGG",mirna_DE = cl_4C,
var_mirnaName = "miRNA",var_mirnaFC = "FC",var_mirnaP = "pvalue", mrnalist = cl_4C_predicted_mrna_entrez_list,
mrna_Weight = NULL, gs_file = "Data/Pathway Analysis/c2.cp.kegg.v7.2.entrez.gmt", optim_method = "IWLS",
p.adj = "fdr", parallelComputing = FALSE, clusterType = "PSOCK")
# total number of significantly enriched pathways
sum(cl_4C_predicted_mirna_mrna_iwls_KEGG_GS$adj.p.val < 0.05)
# remove enriched pathways with a random enrichment of more than 10 %
cl_4C_KEGG_plot <- cl_4C_predicted_mirna_mrna_iwls_KEGG_GS %>% filter(!GS %in% bias_KEGG)
In total 32 (31 after bias correction) KEGG pathways were significantly enriched in cluster 4C.
#plot results (volcano plot)
rbiomirgs_volcano(gsadfm = cl_4C_KEGG_plot,topgsLabel = TRUE,n = 15,gsLabelSize = 2,
sigColour = "red",plotWidth = 250,plotHeight = 220,xLabel = "model coefficient")
# plot distribution of enriched gene sets
rbiomirgs_bar(gsadfm = cl_4C_KEGG_plot,signif_only = F,gs.name = F,
n = "all",xLabel = "gene set", yLabel = "model coefficient", plotWidth = 250, plotHeight = 220)
# plot top enriched gene sets
rbiomirgs_bar(gsadfm = cl_4C_KEGG_plot,signif_only = 15,gs.name = T,xLabel = "model coefficient",
yTxtSize = 7, n = "all", plotWidth = 250, plotHeight = 220)
Interpretation???
1.3.5.2 GO:BP analysis
Calculation of controls for GO:BP analysis:
# calculate GS enrichment for 50 random controls
n <- 50
# ctrl_list_GO_BP <- GS_controls(data= ctrl_stats,rep = n, miRdata = dat_miRBase$miRNA, sample_n = 10,
# gs_file = "c5.go.bp.v7.2.entrez.xls")
# to facilitate downstream analysis the ctrl_list is saved
# saveRDS(ctrl_list_GO_BP, file = "ctrl_list_GO_BP.rds")
ctrl_list_GO_BP<- readRDS(file = "Data/Pathway Analysis/ctrl_list_GO_BP.rds") # file is available at https://github.com/MBender1992/PhD/blob/Marc/Data/Pathway%20Analysis/ctrl_list_GO_BP.rds
# calculate bias and mean targeted genes
res_ctrl_GO_BP <-pathway_ctrl_summary(ctrl_list_GO_BP, n=n)
Cluster 1A
# calculate GS by logistic regression
rbiomirgs_logistic(objTitle = "cl_1A_predicted_mirna_mrna_iwls_GO_BP",mirna_DE = cl_1A,
var_mirnaName = "miRNA",var_mirnaFC = "FC",var_mirnaP = "pvalue", mrnalist = cl_1A_predicted_mrna_entrez_list,
mrna_Weight = NULL,gs_file = "Data/Pathway Analysis/c5.go.bp.v7.2.entrez.xls",optim_method = "IWLS",
p.adj = "fdr",parallelComputing = FALSE,clusterType = "PSOCK")
# number of enriched GO terms
sum(cl_1A_predicted_mirna_mrna_iwls_GO_BP_GS$adj.p.val < 0.05)
# remove enriched pathways with a random enrichment of more than 10 % and only keep pathways that converged
bias_GO_BP <- names(res_ctrl_GO_BP$bias[res_ctrl_GO_BP$bias > 0.1])
cl_1A_Go_BP_plot <- cl_1A_predicted_mirna_mrna_iwls_GO_BP_GS %>% filter(!GS %in% bias_GO_BP & converged == "Y")# number of significantly enriched pathways
In total 501 (442 after bias correction) GO terms (biological process) were significantly enriched. 59 Go terms were randomly enriched in more than 10 % of control simulations and omitted from the analysis (also for cluster 2B and 4C).
#plot results (volcano plot)
rbiomirgs_volcano(gsadfm = cl_1A_GO_BP_plot,topgsLabel = TRUE,n = 5,gsLabelSize = 1.8,
sigColour = "#CD534CFF",plotWidth = 250,plotHeight = 220,xLabel = "model coefficient")
# plot distribution of enriched gene sets
rbiomirgs_bar(gsadfm = cl_1A_GO_BP_plot,signif_only = F,gs.name = F,
n = "all",xLabel = "gene set", yLabel = "model coefficient", plotWidth = 250, plotHeight = 220)
# plot top enriched gene sets
rbiomirgs_bar(gsadfm = cl_1A_GO_BP_plot,signif_only = 15,gs.name = T,xLabel = "model coefficient",
yTxtSize = 7, n = 50, plotWidth = 250, plotHeight = 220)
Cluster 2B
# calculate GS by logistic regression
rbiomirgs_logistic(objTitle = "cl_2B_predicted_mirna_mrna_iwls_GO_BP",mirna_DE = cl_2B,
var_mirnaName = "miRNA",var_mirnaFC = "FC",var_mirnaP = "pvalue", mrnalist = cl_2B_predicted_mrna_entrez_list,
mrna_Weight = NULL,gs_file = "Data/Pathway Analysis/c5.go.bp.v7.2.entrez.xls",optim_method = "IWLS",
p.adj = "fdr",parallelComputing = FALSE,clusterType = "PSOCK")
# number of enriched GO terms
sum(cl_2B_predicted_mirna_mrna_iwls_GO_BP_GS$adj.p.val < 0.05)
# remove enriched pathways with a random enrichment of more than 10 % and only keep pathways that converged
cl_2B_Go_BP_plot <- cl_2B_predicted_mirna_mrna_iwls_GO_BP_GS %>% filter(!GS %in% bias_GO_BP & converged == "Y")# number of significantly enriched pathways
In total 525 (466 after bias correction) GO terms were enriched in cluster 2B.
#plot results (volcano plot)
rbiomirgs_volcano(gsadfm = cl_2B_GO_BP_plot,topgsLabel = TRUE,n = 10,gsLabelSize = 1.8,
sigColour = "#CD534CFF",plotWidth = 250,plotHeight = 220,xLabel = "model coefficient")
# plot distribution of enriched gene sets
rbiomirgs_bar(gsadfm = cl_2B_GO_BP_plot,signif_only = F,gs.name = F,
n = "all",xLabel = "gene set", yLabel = "model coefficient", plotWidth = 250, plotHeight = 220)
# plot top enriched gene sets
rbiomirgs_bar(gsadfm = cl_2B_GO_BP_plot,signif_only = T,gs.name = T,xLabel = "model coefficient",
yTxtSize = 7, n = 50, plotWidth = 250, plotHeight = 220)
Cluster 4C
# calculate GS by logistic regression
rbiomirgs_logistic(objTitle = "cl_4C_predicted_mirna_mrna_iwls_GO_BP",mirna_DE = cl_4C,
var_mirnaName = "miRNA",var_mirnaFC = "FC",var_mirnaP = "pvalue", mrnalist = cl_4C_predicted_mrna_entrez_list,
mrna_Weight = NULL,gs_file = "Data/Pathway Analysis/c5.go.bp.v7.2.entrez.xls",optim_method = "IWLS",
p.adj = "fdr",parallelComputing = FALSE,clusterType = "PSOCK")
# number of enriched GO terms
sum(cl_4C_predicted_mirna_mrna_iwls_GO_BP_GS$adj.p.val < 0.05)
# remove enriched pathways with a random enrichment of more than 10 % and only keep pathways that converged
cl_4C_Go_BP_plot <- cl_4C_predicted_mirna_mrna_iwls_GO_BP_GS %>% filter(!GS %in% bias_GO_BP & converged == "Y")# number of significantly enriched pathways
In total 766 (707 after bias correction) GO terms were enriched in cluster 4C.
#plot results (volcano plot)
rbiomirgs_volcano(gsadfm = cl_4C_GO_BP_plot,topgsLabel = TRUE,n = 10,gsLabelSize = 1.8,
sigColour = "#CD534CFF",plotWidth = 250,plotHeight = 220,xLabel = "model coefficient")
# plot distribution of enriched gene sets
rbiomirgs_bar(gsadfm = cl_4C_GO_BP_plot,signif_only = F,gs.name = F,
n = "all",xLabel = "gene set", yLabel = "model coefficient", plotWidth = 250, plotHeight = 220)
# plot top enriched gene sets
rbiomirgs_bar(gsadfm = cl_4C_GO_BP_plot,signif_only = T,gs.name = T,xLabel = "model coefficient",
yTxtSize = 7, n = 50, plotWidth = 250, plotHeight = 220)